The construction of the Google Sailboat Building. (Source: https://i.redd.it/dh75zlho59o71.jpg)
The construction of the Google Sailboat Building. (Source: https://i.redd.it/dh75zlho59o71.jpg)

1. Introduction

Population – it’s a very important statistic to measure. And lately, we can see a huge incoming population to Texas, especially Austin. In fact, in 2020, the measured population was 965,827, with the estimated population in 2024 being 984,567, nearly a 2% increase in population.

The data was collection from HERE, the City of Austin Data Portal. By simply changing the Calendar Year Issued date to be equals 2023, I was able to retrieve the csv data.

From the data, I will be investigating the following:

  • The zip codes that appears most frequently

  • What kinds of work were done often

  • When were most projects approved

  • When the projects were completed

  • During what day of the week most projects take place

Just to name a few…

2. Data Preparation

I obtained the data by downloading the CSV file from the website.

Since the date columns were classified as chr, I had to turn them into Date columns.
The date columns were as follows: Applied.Data, Issued.Date, Expires.Date, and Completed.Date.

I attempted to change the Status.Date column to be of type Date, but was encountering some issues in conversions, so I left it as is. This means that when performing analyses, this column cannot be worked on. I will rather rely on the Status.Current column to analyze the projects.

After the re-factoring of columns, The only columns to have NA values are the Completed.Date and Applied.Date columns. These NA values are important as the Completed.Date NA’s signify that the project is yet to be completed or the project was withdrawn and the Applied.Date NA’s signify that no one applied for the project but was just rather issued to be done (which could be the government issuing its own project).

The column names didn’t have to be changed (as the names are already quite descriptive and the only changes would be to remove the .’s from the names).

3. Variable Descriptions

Type Description
Permit.Class.Mapped character Type of Permit
Work.Class character Type of Work
Applied.Date Date Permit Application Date
Issued.Date Date Permit Issuance Date
Day.Issued character Day of Week Permit was Issued
Calendar.Year.Issued integer Calendar Year Permit was Issued
Fiscal.Year.Issued integer Fiscal Year Permit was Issued
Status.Current character Current Status of Project
Expires.Date Date Expiration Date of Project
Completed.Date Date Completion Date of Project
Original.Address.1 character Address of Project
Original.Zip integer Zip Code of Project

4. Univariate Analyses

I chose the 4 following variables to analyze:

  • Work.Class

  • Day.Issued

  • Status.Current

  • Original.Zip

Work.Class

From the graph here, we can see that the types of work with the lowest number of occurrences are Remodel Mobile Home and Life Safety, with Relocation and Addition also having very few occurrences. The most amount of work happens when constructing something New or performing a Remodel.

Day.Issued

Seems like Thursday is the most popular day to issue a building permit, and Sunday is the least popular day. It is quite interesting to see that the weekends have the lowest number of occurrences, but it does make sense since the least amount of officials will be working on the weekends.

Status.Current

Analyzing the graph here, most of the projects that were approved did go to completion, as indicated by the large differences in occurrences between Final and the other statuses. A large amount of the approved permits (projects) remained Active or had Expired, which is interesting. Even though the permits were approved, some of the projects remained unfinished. This could be for several reasons.
1) It could be that the project rolled over and was completed in 2024, and thus remained active.
2) Some of the projects ended up needing more money or time necessary to be completed, therefore expiring.

Original.Zip

I think the analysis can be done pretty obviously here – just by looking at the graph, it is easy to say that most of the permits for construction were approved for the zip code 78704. 78704 is where Zilker Park, South Lamar, and St. Edward’s University are, making them high-population areas (just south of UT Austin). Because of its proximity to downtown Austin, it makes most sense for most building permits to be approved for that area. The second highest zip code occurrences happens in 78745, which is right below 78704.

5. Bivariate Analyses

I chose the following variable relationships to analyze:

  • Issued.Date - Applied.Date (Approval Time)

  • Expires.Date - Issued.Date (Project Length)

  • Approval Time vs. Project Length (Do longer projects take more approval time?)

  • Work.Class vs. Project Length (How does the type of project relate to given time?)

  • Work.Class vs. Day.Issued (How does the type of project relate to the day issued?)

  • Work.Class vs. Status.Current (How does the type of project relate to it’s completion?)

Approval Time

With the graph above, we can view time it took for the projects to be approved. Since the approval time for some of the projects took several years, there was data in the 15000s. This made the boxplot nearly unusable, and thus I chose to use log for convenient viewing.
However, we do need the real data, and so the summary of the data is below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -183.0    13.0    61.0   192.7   194.0 15048.0       1

Notice how there’s a negative approval time? When I looked at the data the portal gave me from the website, it has it so that the issuing date is before the applied date, which makes no sense. So I dug a little deeper (going HERE), and doing a quick search of the address giving me the negative approval time gave me the correct information. The applied date was correct, but the issuing date was indeed wrong (which meant the day issued was also wrong). To fix this, I have manually change this.

With this, let’s revisit the distribution and summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -183.0    13.0    61.0   192.7   194.0 15048.0       1

And there we go, the summary now has a minimum of 0 days. It’s interesting to see that 75% of the approvals (as seen from the 3rd Quartile) takes less than 194 days, but there are those which take up to more than 15,000 days, which is more than 41 years!
Another interesting piece of data is the NA’s. For some reason, there is an address (if you want to know, 804 PARK BLVD) which had an Issued.Date but an NA for the Applied.Date. This is why we are getting the NA values in the summary.

Project Length

Similar to the last one, let’s view the summary:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -14857.0     52.0    138.0    149.4    245.0    667.0     4276

Once again, we are getting the negative values.
But before we attack the negatives, there’s a column from the summary labelled NA's, which is interesting. This is derived due to some of the projects (in this case, 4276) not having a completed date (thus set to NA) but having an issued date. When subtracting values from NA, the result ends up being, you guessed it, NA. The projects with statuses Active, Expired, and VOID all had NA completed dates, with each having 2561, 1669, and 33, respectively. These total up to 4263, because Withdrawn, Pending Permit, and On Hold all have their amount of NA’s (which total up to 13), giving us the 4276.

Let’s filter the negative and NA values out and view the graph and data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    56.0   141.0   165.7   246.0   667.0

Now we can visualize that the average time it took for a project to be completed was 165 (nearly 166) days, with the maximum being 667 days. More than 75% of the day differences are below 246 days.

For the sake of repetition, I will mutate the dataframe to have the above two as additional columns.

Approval Time vs. Project Length

Wait, the data looks awkward, right? How were some projects completed before they were even issued (as seen with the negative Project Lengths)? Digging some more into this, I went back to the record (the link I had to replace the error before), and it has the same information. To not spread misinformation, I will filter out the negative lengths and view the graph that way.

Looking at the graph, most of the cluster is below an Approval Time of 2500 days. But the question is how do these two relate? From previous analyses, we can know that the more horizontal/vertical a graph is, the less correlation the two variables are, and the graphs with a slope have a correlation. This graph is both horizontal and vertical, so let’s see how this turns out.

## [1] -0.04

Well, this graph shows a -0.04 correlation, which is a Weak Negative relationship. Anything with no correlation is bad enough, but this is worse! These two variables move in opposite directions - the longer the project length, the shorter the approval time (essentially, the graph is downward sloping rather than upward sloping). Thing is, it would be vital if the correlation was more negative than it is, but with -0.04, that’s basically nothing. In the end, it just means that these two variables have very little effect on each other, but the general trend is somewhat (very, very slightly) negative sloping.

Work.Class vs. Project Length

The type of work that had the largest median project length was Shell. Shell is a construction term used to describe the exterior and the core of the building (which would be walls, roofing, foundation, windows, etc.). In contrast, Life Safety had the lowest median project length. Addition and Remodel had the largest project length (which would be 667, as observed from the Project Lengths analysis). New had the second largest max, with Repair coming in third and Remodel coming in fourth. Interestingly, Life Safety, alongside having the lowest median project length (and maximum project length for that matter), also has the lowest number of occurrences. New has the most amount of occurrences.

Work.Class vs. Day.Issued

For the weekdays (M, T, W, Th, F), we can see that the New category leads in occurrences. On the weekends (Sat, Sun), due to the lower quantity of work being done, it’s harder to see which one really takes the lead, but on a high level, New takes the lead for Saturday and Addition and Remodel takes the lead on Sunday.
One patterns of types of work is similar on Mondays, Tuesdays, and Fridays while another is similar on Wednesdays and Thursdays.
The analysis we can perform here is simple: more work is assigned on weekdays than weekends. Since the pattern is quite similar for all weekdays, there would be no correlation between the type of work done and the day the project is issued on.

Work.Class vs. Status.Current

Here, we can see that a majority of the projects did end up going to completion, but even if the New column had the most amount of completed projects, it also had the largest amount of active projects. Repair had the largest amount of expired projects, and Remodel had the largest amount of withdrawn projects.

Seeing as how interesting it was to analyze the graphs before, let’s try to create a MULTIVARIATE plot, which has 3 variables compared against each other.

Above is the relationship between three variables: Work.Class, Status.Current, and Day.Issued. As seen, a majority of the projects were finalized, but a large proportion of them for each day remained active. Here, we are basically looking at the same data as the sixth analysis (Type of Work vs. Completion Status), but now we’ve split them up into days. Nevertheless, the pattern remains the same: the weekdays get a majority of the projects, with New being the highest for them all and the patterns of the data heights matching with other days.

6. Reproducibility Resources

Link to GitHub -> {GitHub}
The repository contains the Project2.Rmd, Project2.html, and Issued_Construction_permits_20241102.csv files.
The Project2.Rmd file contains the code for the project.
The Project2.html file contains the html view of the base project. This file will not be modified until the code in the Rmd is changed and the file is knit.
The Issued_Construction_permits_20241102.csv file is the csv that contains the data. This is directly what the permits dataframe reads into.

7. Choice Elements

The choice elements I have are A, D, E, G, and I.

A

A.1 - Line 238 - Getting the count of Active occurrences in Status.Current
A.2 - Line 238 - Getting the count of Expired occurrences in Status.Current
A.3 - Line 238 - Getting the count of VOID occurrences in Status.Current
A.4 - Line 294 - Getting the correlation between Project Length and Approval Time

D

D.1 - Line 258 - Mutated the dataframe with three new columns: IssuedDayDiff, ExpiredDayDiff, and ProjectLength

E

E.1 - Line 222 - Explaining how the NA values in the Approval Time Lengths came from an NA value in the Applied.Date column
E.2 - Line 238 - Explaining how the NA values in the Project Lengths were a sum of Active, Expired, VOID, and other statuses

G

G.1 - Line 052 - Converting the Applied.Date column to be of type Date, not character
G.2 - Line 053 - Converting the Issued.Date column to be of type Date, not character
G.3 - Line 054 - Converting the Expires.Date column to be of type Date, not character
G.4 - Line 055 - Converting the Completed.Date column to be of type Date, not character

I

I.1 - Line 347 - The multivariate plot containing the relationship between Work.Class, Status.Current, and Day.Issued

8. Conclusion

Throughout the project, we have done several analyses. And we found a lot of information too. For starters, we know the government has a lot of wrong information in their datasets, which is just hilarious. But jokes aside, we did gain some vital findings related to my prompt. A majority of the building permits issued in 2023 were New and Shell construction, which means more houses and more apartments were being built to house a larger number of incoming people. Most of the projects that were issued were also applied for in 2023, which goes to show the efficiency the construction workers had to be working at in order for the project to be completed. We found that while there was nearly no relationship between the time length of a project and the time it took for the project to be approved (issued), there is somewhat of a relationship between the type of project the work is and the time length of said project, which makes sense considering building a new home would take longer than adding life safety features to buildings. We found that the Zip Codes which had a majority of the projects were in downtown, indicating large amounts of growth within the city and the nearby suburban areas. While this project didn’t allow me to truly see how the population varied throughout the year, it provided key information to proving that the incoming population was large.

9. References

https://i.redd.it/dh75zlho59o71.jpg
Image used for Project

https://data.austintexas.gov/Building-and-Development/Building-Permits-Issued-in-2015/ayav-qckq
Data collection portal

https://data.austintexas.gov/Building-and-Development/Issued-Construction-Permits/3syk-w9eu/data_preview
Larger data portal, with information from all years (backup source for information correction)